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Complexities of Human Promoter Sequences 
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By means of the diffusion entropy approach, we detect the scale- invariance characteristics embed- 
ded in the 4737 human promoter sequences. The exponent for the scale-invariance is in a wide range 
of [0.3,0.9], which centered at S c = 0.66. The distribution of the exponent can be separated into 
left and right branches with respect to the maximum. The left and right branches are asymmetric 
and can be fitted exactly with Gaussian form with different widths, respectively. 
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I. INTRODUCTION 

Understanding gene regulation is one of the most excit- 
ing topics in molecular genetics Promoter sequences 
are crucial in gene regulation. The analysis of these re- 
gions is the first step towards complex models of regula- 
tory networks. 

A promoter is a combination of different regions with 
different functions 0, d, H, @. Surrounding the tran- 
scription start site is the minimal sequence for initiat- 
ing transcription, called core promoter. It interacts with 
RNA polymerase II and basal transcription factors. Few 
hundred base pairs upstream of the core promoter are 
the gene-specific regulatory elements, which are recog- 
nized by transcription factors to determine the efficiency 
and specificity of promoter activity. Far distant from 
the transcription start site there are enhancers and dis- 
tal promoter elements which can considerably affect the 
rate of transcription. Multiple binding sites contribute 
to the functioning of a promoter, with their position and 
context of occurrence playing an important role. Large- 
scale studies show that repeats participate in the regula- 
tion of numerous human and mouse genes @. Hence, the 
promoter's biological function is a cooperative process of 
different regions such as the core promoter, the gene- 
specific regulatory elements, the enhancers/silencers, the 
insulators, the CpG islands and so forth. But how they 
cooperate with each other is still a problem to be inves- 
tigated carefully. 

The structures of DNA sequences determine their bio- 
logical functions [7J. Recent years witness an avalanche 
of finding nontrivial structure characteristics embedded 
in DNA sequences. Detailed works show that the non- 
coding sequences carry long-range correlations [H, 0, [Io| . 
The size distributions of coding sequences and non- 
coding sequences obey Gaussian or exponential and 



power-law flTI.[Tp. resp ectively. Theoretical model-based 
simulations 13T 1J, [lB, HE] tell us that the parts of the 
promoters where the RNA transcription has started are 
more active than a random portion of the DNA. By 
means of the nonlinear modeling method it is found that 
along the putative promoter regions of human sequences 
there are some segments much more predictable com- 
pared with other segments [lTj ■ All the evidences suggest 
that the nontrivial structure characteristics of a promoter 
determine its biological functions. The statistical prop- 
erties of a promoter may shed light on the cooperative 
process of different regions. 

Experimental knowledge of the precise 5' ends of cD- 
NAs should facilitate the identification and characteri- 
zation of regulatory sequence elements in proximal pro- 
moters (18J. Using the oligocapping method, Suzuki et 
al. identify the transcriptional start sites from cDNA 
libraries enriched in full-length cDNA sequences. The 
identified transcriptional start sites are available at the 
Database, |http://dbtss.hgc.jp/| [l9| . Consequently, 
Leonardo et al. have used this data set and aligned 
the full-length cDNAs to the human genome, thereby 
extracting putative promoter regions (PPRs) [20]. Us- 
ing the known transcriptional start sites from over 5700 
different human full-length cDNAs, a set of 4737 distinct 
PPRs are extracted from the human genome. Each PPR 
consists nucleotides from —2000 to +10006p, relative to 
the corresponding transcriptional start site. They have 
also counted eight-letter words within the PPRs, using 
z-scores and other related statistics to evaluate the over- 
and under- representations. 

In this paper, by means of the concept of diffusion 
entropy (DE) we try to detect the scale-invariant char- 
acteristics in these putative promoter regions. 



II. DIFFUSION ENTROPY ANALYSIS 
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The diffusion entropy (DE) method is firstly designed 
to capture the scale-invariance embedded in time series 
[2ll I22I 23] . To keep the description as self-contained as 
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possible, we review briefly the procedures. 

We consider a PPR denoted with Y = 
(2/1)2/2) ••• > J/3001 )j where is the element at the 
position s and y s = A,T,C or G. Replacing A, T and 
G, G with —1 and +1, respectively, the original PPR is 
mapped to a time series X — (x\,X2, ■ ■ ■ , 2:3001 )■ There 
is not a trend in this series, i.e., X is stationary. 

Connecting the starting and the end of X, we can ob- 
tain a set of delay-register vectors, which reads, 

Ti(t) = (xi,X2, • • • ,Xt) 
T 2 (t) = (x 2 ,x 3 , ■ ■ ■ ,x t+1 ) 

: W 
Taooi(t) = (2:3001, £1, • • • i^t-i) 
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Regarding each vector as a trajectory of a particle in 
duration of t time units, all the vectors can be described 
as a diffusion process of a system containing 3001 parti- 

/ ?i(0) \ 
T 2 (0) 

cles. The initial state of the system is 







\T 3001 (0) J 



Accordingly, at each time step t we can calculate dis- 
placements of all the particles. The probability distribu- 
tion function (PDF) of the displacements can be approx- 
imated with p(m, t) ~ m /3001, where m — —t, —t + 
1 , ■ • ■ , t and K m is the number of the particles whose 
displacements are m. It can represent the state of the 
system at time t. 

As a tenet of complexity theory [24|, , complexity is 
related with the concept of scaling invariance. For the 
constructed diffusion process, the scaling invariance is 
defined as, 



FIG. 1: (Color online) Typical DE results. The results for 
the PPRs numbered 1, 1000, 2000 and 3000 are presented. In 
considerable wide regions of t , the curves of DE can be fitted 
almost exactly with the linear relation in Eq.(4). 



Hence, a large value of 6 implies that A, T or G, G accu- 
mulate strongly in a scale-invariance way, respectively. 

However, correct evaluation of the scaling exponent is a 
nontrivial problem. In literature, variance-based method 
is used to detect the scale-invariance. But the obtained 
Hurst exponent iJmay be different from the real 8, that 
is, generally we haveiJ ^ 8. And for some conditions, 
the variance is divergent, which leads the invalidation of 
the variance-method at all. To overcome these shortages, 
the Shannon entropy for the diffusion can be used, which 
reads, 



S(t) — — p{m,t)ln.p(m,t) 

m——t 



E 



(3) 



m——t 



3001 



where 6 is the scaling exponent and can be regarded as a 
quantitative description of the PPR's complexity. If the 
elements in the PPR are positioned randomly, the result- 
ing PDF obeys a Gaussian form and 8 = 0.5. Complexity 
of the PPR is expected to generate a departure from this 
ordinary condition, that is, 8 7^ 0.5. 

The value of 8 can tell us the pattern characteristics of 
a PPR. The departure from the ordinary condition can 
be described with a preferential effect. Let the element 
is A,T (or G, G), the preferential probability for the fol- 
lowing element's being A, T (or G, G) is W pre . A positive 
preferential effect, i.e, W pre > 0.5, leads to the value of 8 
larger than 0.5. While a negative preferential effect, i.e, 
Wpre < 0.5, can induce the value of 8 smaller than 0.5. 



This diffusion-based entropy is called diffusion entropy 
(DE). A simple computation leads the relation between 
the scaling invariance defined in Eq.2 and the DE as, 



S(t)=A + 8lnt, (4) 

where A is a constant depends on the PDF. Detailed 
works show that DE is a reliable method to search the 
correct value of 8, regardless the form of the PDF [26, 

The complexity in the PDF can be catalogued into 
two levels [30j, the primary one due to the extension of 
the probability to all the possible displacements to, and 
the secondary one due to the internal structures. Conse- 
quently, we should consider also the corresponding shuf- 
fling sequences as comparison. 
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FIG. 2: Distribution of the maximum interval At in which one 
can find scale-invariant characteristics. Keeping the standard 
deviation of the fitting result in the range of < 0.05 , we 
can find the maximum intervals At for all the PPRs. The 
distribution tells us that generally the scale-invariance can be 
found over two to three decades of the scale t . 
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FIG. 3: (Color online) The complex index S distributes in 
a wide range of [0.3,0.9]. The distribution can be sepa- 
rated into two branches with respect to the center S c = 0.66. 
The two branches are asymmetric and obey exactly the Gau- 
usian function, respectively. The widths and centers of 
the left and right branches are (w left , x l c eft ) = (0.17,0.67), 
(w" 3ht y x r c i9ht ) = (0.10, 0.65). The centers coincide with each 



other, 



Jeft 



right 



S c — 0.66. The right branch dis- 



tributes in a significant narrow region. 



III. RESULTS AND DISCUSSIONS 

The DEs for all the 4737 PPRs are calculated. As a 
typical example, Fig.l presents the DE results for the 
PPRs numbered 1, 1000, 2000 and 3000. In considerable 
wide regions of t, the curves of DE can be fitted almost 
exactly with the linear relation in Eq.4. 

For each PPR, there exists an interval, t ~ t + At, in 
which the PDF behaves scale-invariance. Keeping simul- 



taneously the standard deviation and the error of the 
scaling exponent for the fitting result in the range of 
< 0.05 and < 0.03, we can find the maximum intervals 
At for all the PPRs. In the fitting procedure, the confi- 
dence level is set to be 95%. The distribution of At, as 
shown in Fig. 2, tells us that generally the scale-invariance 
can be found over two to three decades of the scale t. 
The concept of DE is based upon statistical theory, that 
is, t n should be large enough so that the statistical as- 
sumptions are valid. To cite an example, we consider a 
random series, whose elements obey a homogenous dis- 
tribution in [0,1]. Only the length of the delay-register 
vectors, t, in Eq.(l) is large enough, the corresponding 
PDF for the displacements, i.e, the summation value of 
each delay-register vector, approaches the Gaussian dis- 
tribution. Consequently, to is not a valuable parameter. 
The values of to for different PPRs are not presented. 

The resulting scaling exponent S ± 0.03 distributes 
in a wide range of [0.3,0.9]. The distribution can 
be separated into two branches with respect to the 
center 6 C = 0.66. The two branches are asymmet- 
ric and can be fitted exactly with the Gauusian func- 
tion, respectively. The widths and centers of the left 



and right branches are (w 



left ^Xeft 



(w 



right ^right 







(0.17,0.67), 



coincide with each other, w le ^ 1 



(0.10,0.65). That is to say, the centers 



,.right 



5 r = 0.66. 



Comparatively, the right branch distributes in a signifi- 
cant narrow region. 

The PPRs are shuffled also. For each PPR, the shuf- 
fling result is obtained by averaging over ten shuffling 
samples. The scaling exponents are almost same, i.e., 
dshuf fling = 0.5±0.03. The detected scale-invariant char- 
acteristics are internal-structure-related. 

How to understand the asymmetric characteristic of 
the distribution of the complexity index 5 is an in- 
teresting problem. In literature, some statistical char- 
acteristics of DNA sequences are captured with evo- 
lution models, such as the long-range correlations and 
the over- and under-representation of strings and so on 
[3ll [H, HI]. From the perspective of evolution, per- 
haps the distribution characteristics may favor a stochas- 
tic evolution model. The initial sequences have same 
complexity S imtial = 5 C = 0.66. With the evolution 
processes the sequences diffuse along two directions, in- 
creasing complexity and decreasing complexity, i.e, the 
index 6 increases and decreases, respectively. The diffu- 
sion coefficients for the two directions are significantly 
different, denoted with D le ^ ^ fright Based upon 
the widths of the two branches we can estimate that, 

D ^ jjjright — & ^ / fright = 1.7. It should be noted 

that, the complexity is regarded as the departure from 
the ordinary condition, 8 — 0.5. In the totally 4737 values 
of 5, only a small portion of them are less than 0.5. Ac- 
cordingly, the PPRs may be catalogued into two classes, 
the PPRs with high complexity and the PPRs with low 
complexity. The former class evolves averagely with a 
slow speed while the later one with a high speed. 

In summary, by means of the DE method, we calculate 
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the complexities of the 4737 PPRs. The distribution of 
the complexity index includes two asymmetric branches, 
which obey Gaussian form with different widths, respec- 
tively. A stochastic evolution model may provide us a 
comprehensive understand of these characteristics. 
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